In silico prediction of aqueous solubility – classification models
نویسندگان
چکیده
Solubility is a very important parameter in pharmaceutical research, especially for the early phase of drug discovery in fully automatized high throughput screening, compound pool extension and SAR and ADME-Tox parameter measurement. In recent years a multitude of models has been published concerned with the exact prediction of aqueous solubility. Still, almost all in the meantime commercially available tools suffer from comparably bad R 2 y values for the prediction of solubility of pharmaceutically relevant molecules [1]. First, this might be attributed either to a bad data situation, as the reaction conditions for obtaining solubility data published in the literature are quite different. Second, many compounds with solubility values extracted from literature are not druglike. But even with high quality data measured in one lab, R 2 y values derived from that data with the latest high-end algorithms are often not satisfying. In a very careful study recently published by Müller et al, with a Gaussian process model they got an R 2 y value of 0.53 on a separate dataset derived from inhouse shake-flask experiments [1]. However, knowing the exact value is not really important for many applications; it is rather important to know whether a certain compound will be insoluble under the used test-conditions and should thus be excluded from the experiment. In order to address this question we built classification models based on two datasets measured inhouse at Boe-hringer-Ingelheim at pH 7.4: one kinetic set of solubility measurements based on nephelometry and one thermo-dynamic set of solubility measurements based on shake-flask experiments. The datasets were divided into three classes, one well soluble class, one insoluble class and a buffer class in between to compensate for noisy data. For these datasets, we built classification models using support vector machines (SVM) and Bayesian regularized neural networks (BRANN), trying several different descriptor sets. In each case, MOE2D descriptors and a SVM model gives the best raw results with an overall accuracy of ~70% for triple crossvalidation. Leaving out the predictions for and of the buffer class i.e. only considering strong outliers, the overall accuracy is ~88.5 %. We evaluated classifier fusion and model applicability domain (MAD) considerations for this dataset. Applying these, we achieved accuracies of ~93 % for ~80 % of the dataset.
منابع مشابه
In silico prediction of aqueous solubility: a multimodel protocol based on chemical similarity.
Aqueous solubility is one of the most important ADMET properties to assess and to optimize during the drug discovery process. At present, accurate prediction of solubility remains very challenging and there is an important need of independent benchmarking of the existing in silico models such as to suggest solutions for their improvement. In this study, we developed a new protocol for improved ...
متن کاملCorrelation and Prediction of Solubility of CO2 in Amine Aqueous Solutions
The solubility of CO2 in the primary, secondary, tertiary and sterically hindered amine aqueous solutions at various conditions was studied. In the present work, the Modified Kent-Eisenberg (M-KE), the Extended Debye-Hückel (E-DH) and the Pitzer models were employed to study the solubility of CO2 in amine aqueous solutions. Two explicit equations are presented to evalu...
متن کاملAccurate Solubility Prediction with Error Bars for Electrolytes: A Machine Learning Approach
Accurate in silico models for predicting aqueous solubility are needed in drug design and discovery and many other areas of chemical research. We present a statistical modeling of aqueous solubility based on measured data, using a Gaussian Process nonlinear regression model (GPsol). We compare our results with those of 14 scientific studies and 6 commercial tools. This shows that the developed ...
متن کاملBinary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection
Aqueous solubility is recognized as a critical parameter in both the early- and late-stage drug discovery. Therefore, in silico modeling of solubility has attracted extensive interests in recent years. Most previous studies have been limited in using relatively small data sets with limited diversity, which in turn limits the predictability of derived models. In this work, we present a support v...
متن کاملPrediction of the pharmaceutical solubility in water and organic solvents via different soft computing models
Solubility data of solid in aqueous and different organic solvents are very important physicochemical properties considered in the design of the industrial processes and the theoretical studies. In this study, experimental solubility data of 666 pharmaceutical compounds in water and 712 pharmaceutical compounds in organic solvents were collected from different sources. Three different artificia...
متن کاملA simple QSPR model to predict aqueous solubility of drugs
Aqueous solubility of a drug/drug candidate is essential data in drug discovery, and an in silico method for predicting the aqueous solubility of drug candidates provides a valuable tool to speed up the process of drug discovery and development. This paper describes a simple quantitative structure property relationship (QSPR) model for predicting the aqueous solubility of drugs which is validat...
متن کامل